1 Background

We hope to explore the relative influence of physical traits, environmental conditions and species identity on the growth rate of trees. A gradient boosted model seems like a good candidate for this work since they:

1.1 Extracting Principle Components for Environmental Traits

We, first, converted the environmental variables to principle components as they were highly correlated. We visualized the PCA and used the eginvectors to help figure which environmental condition best explained that PC. There were 5 - Soil Fertility, Light, Temperature, pH, Soil.Humidity.Depth, and Slope.

1.1.1 PC1-PC2

1.1.2 P3-PC4

1.1.3 PC5-PC6

1.1.4 Correlation on Plant Traits

We want to ensure that the plant traits are not correlated. Past work suggests that they are not easily represented using a PCA. So, we will not use the this feature reduction method.

1.2 About Gradient Boosted Models

A gradient boosted machine/model is a machine learning model that uses decision trees to fit the data.

A decision tree first starts with all of the observations, then, from the variables provided, it tries to figure out which variable split would result in the “purest” groupings of the data. So, in this case, it would try to place rows with higher growth rates in one node, and those with lower growth rates in another node.

GBMs are an ensemble of decision trees, nut they are fit sequentially. We call GBMs an ensemble of weak learners as each subsequent tree is an attempt to correct the errors of the previous tree. Thus, while one tree, by itself, can not describe the relationships, with the use of all the trees, we can. Below is a figure by Bradly Bohemke that attempts to illustrate how each subsequent tree improves the fit on the data. Boosted regression decision stumps as 0-1024 successive trees are added

2 Compare Models

We compared the fit of three used a gradient boosted models to determine how environmental gradients and physical traits influence RGR:

2.1 Model 1: Tree Age + Plant Trait + Environmental Conditions

2.1.1 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_96"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 6
## 
## $min_rows
## [1] 64
## 
## $nbins
## [1] 256
## 
## $nbins_cats
## [1] 512
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3480.334
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.33
## 
## $col_sample_rate
## [1] 0.42
## 
## $col_sample_rate_per_tree
## [1] 0.2
## 
## $min_split_improvement
## [1] 1e-08
## 
## $histogram_type
## [1] "UniformAdaptive"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
##  [1] "Soil.Fertility"     "Light"              "Temperature"        "pH"                
##  [5] "Slope"              "Estem"              "Branching.Distance" "Stem.Wood.Density" 
##  [9] "Leaf.Area"          "LMA"                "LCC"                "LNC"               
## [13] "LPC"                "d15N"               "t.b2"               "Ks"                
## [17] "Ktwig"              "Huber.Value"        "X.Lum"              "VD"                
## [21] "X.Sapwood"          "d13C"               "Tree.Age"          
## 
## $y
## [1] "BAI_GR"

2.1.2 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bai_residuals <-
  gbm(BAI_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Tree.Age", "BAI_GR"))),
      n.trees = 1000,
      interaction.depth = 6, #max depth 
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 10, #col_sample_rate 
      bag.fraction = 0.33, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

2.1.3 Relative Importance

First, we look at the importance of variables in the model.

2.1.4 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Modulus of Elasticity for Stem

Branching Distance

Stem Wood Density

Leaf Area

Leaf Mass Per Area

Leaf Carbon Concentration

Leaf Nitrogen Concentration

Leaf Phosphorus Concentration

Delta 15N

Thickness to Span Ratio

Conductivity Per Sapwood Area

Conductivity per Branch

Huber Value

Percent Lumen

Vessel Diameter

Percent Sapwood

Delta 13C

Tree Age

2.1.5 Performance

How does the model perform when we use the true individual trait value?

2.1.6 Interactions | Table

Let’s explore the interactions in these data.

2.1.7 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 19.714, df = 2, p-value = 5.239e-05
##                                                                                  Comparison         Z
## 1 Environmental Conditions:Environmental Conditions - Plant Traits:Environmental Conditions -2.744831
## 2             Environmental Conditions:Environmental Conditions - Plant Traits:Plant Traits -4.630593
## 3                         Plant Traits:Environmental Conditions - Plant Traits:Plant Traits -2.108346
##        P.unadj        P.adj
## 1 6.054205e-03 1.210841e-02
## 2 3.646202e-06 1.093861e-05
## 3 3.500109e-02 3.500109e-02

2.1.8 Interaction | Group | Violin

2.1.9 Interaction | Group | Boxplot

2.1.10 Interaction | Group | Density

2.1.11 Interaction | Group | Top

2.1.12 Interactions | Plots

Now, we plot interactions with values>0.10.

Soil Fertility:Conductivity per Branch

Leaf Nitrogen Concentration:Huber Value

Slope:Vessel Diameter

Huber Value:Tree Age

Leaf Phosphorus Concentration:Percent Lumen

Percent Lumen:Percent Sapwood

Leaf Phosphorus Concentration:Tree Age

Conductivity per Branch:Tree Age

Modulus of Elasticity for Stem:Delta 13C

Slope:Tree Age

Delta 15N:Conductivity per Branch

Leaf Nitrogen Concentration:Percent Lumen

Huber Value:Percent Sapwood

Light:Branching Distance

pH:Tree Age

Branching Distance:Percent Lumen

Delta 15N:Thickness to Span Ratio

Slope:Conductivity per Branch

Branching Distance:Leaf Nitrogen Concentration

Leaf Phosphorus Concentration:Thickness to Span Ratio

Leaf Area:Huber Value

Huber Value:Percent Lumen

pH:Conductivity per Branch

Slope:Huber Value

Light:Delta 13C

Temperature:pH

Leaf Carbon Concentration:Huber Value

Slope:Thickness to Span Ratio

Soil Fertility:Light

Light:Conductivity per Branch

Branching Distance:Delta 15N

Temperature:Branching Distance

Temperature:Huber Value

Leaf Phosphorus Concentration:Percent Sapwood

Vessel Diameter:Delta 13C

Modulus of Elasticity for Stem:Leaf Carbon Concentration

Thickness to Span Ratio:Conductivity Per Sapwood Area

Temperature:Leaf Area

Conductivity per Branch:Delta 13C

Leaf Area:Leaf Nitrogen Concentration

Leaf Area:Thickness to Span Ratio

Percent Sapwood:Tree Age

Delta 15N:Huber Value

Leaf Nitrogen Concentration:Percent Sapwood

Slope:Leaf Nitrogen Concentration

Modulus of Elasticity for Stem:Delta 15N

pH:Thickness to Span Ratio

Modulus of Elasticity for Stem:Huber Value

Delta 15N:Delta 13C

Leaf Carbon Concentration:Percent Sapwood

Branching Distance:Leaf Carbon Concentration

Thickness to Span Ratio:Percent Sapwood

Leaf Nitrogen Concentration:Thickness to Span Ratio

Conductivity Per Sapwood Area:Conductivity per Branch

Thickness to Span Ratio:Conductivity per Branch

Light:Leaf Mass Per Area

Thickness to Span Ratio:Huber Value

2.1.13 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

2.2 Model 2: Tree Age + Species Identity + Environmental Conditions

2.2.1 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_56"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 3
## 
## $min_rows
## [1] 1
## 
## $nbins
## [1] 512
## 
## $nbins_cats
## [1] 512
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3585.667
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.58
## 
## $col_sample_rate
## [1] 0.74
## 
## $col_sample_rate_per_tree
## [1] 0.76
## 
## $min_split_improvement
## [1] 0
## 
## $histogram_type
## [1] "UniformAdaptive"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
## [1] "Soil.Fertility" "Light"          "Temperature"    "pH"             "Slope"         
## [6] "Species"        "Tree.Age"      
## 
## $y
## [1] "BAI_GR"

2.2.2 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bai_residuals_species <-
   gbm(BAI_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, "Species", "Tree.Age", "BAI_GR"))) %>%
        mutate(Species = factor(Species)),
      n.trees = 1000,
      interaction.depth = 3, # max depth
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 5, #col_sample_rate 
      bag.fraction = 0.58, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

2.2.3 Relative Importance

First, we look at the importance of variables in the model.

2.2.4 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Species

Tree Age

2.2.5 Performance

How does the model perform?

2.2.6 Interactions | Table

Let’s explore the interactions in these data.

2.2.7 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 0.38788, df = 1, p-value = 0.5334

2.2.8 Interaction | Group | Violin

2.2.9 Interaction | Group | Boxplot

2.2.10 Interaction | Group | Density

2.2.11 Interactions | Plots

Now, we plot interactions with values>0.10.

Species:Tree Age

Temperature:Species

Species:Slope

Light:Species

pH:Species

Soil Fertility:Species

Light:Tree Age

pH:Slope

Soil Fertility:Light

2.2.12 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

2.3 Model 3: Tree Age + Species Identity + Plant Trait + Environmental Conditions

2.3.1 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_92"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 12
## 
## $min_rows
## [1] 4
## 
## $nbins
## [1] 128
## 
## $nbins_cats
## [1] 2048
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3526.383
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.54
## 
## $col_sample_rate
## [1] 0.39
## 
## $col_sample_rate_per_tree
## [1] 0.76
## 
## $min_split_improvement
## [1] 0
## 
## $histogram_type
## [1] "UniformAdaptive"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
##  [1] "Soil.Fertility"     "Light"              "Temperature"        "pH"                
##  [5] "Slope"              "Estem"              "Branching.Distance" "Stem.Wood.Density" 
##  [9] "Leaf.Area"          "LMA"                "LCC"                "LNC"               
## [13] "LPC"                "d15N"               "t.b2"               "Ks"                
## [17] "Ktwig"              "Huber.Value"        "X.Lum"              "VD"                
## [21] "X.Sapwood"          "d13C"               "Species"            "Tree.Age"          
## 
## $y
## [1] "BAI_GR"

2.3.2 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_baiSpeciesAgeEP <-
  gbm(BAI_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep,"Species" ,
                        "Tree.Age", "BAI_GR")))%>%
        mutate(Species = factor(Species)),
      n.trees = 1000,
      interaction.depth = 12, #max depth 
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 10, #col_sample_rate 
      bag.fraction =  0.54, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

2.3.3 Relative Importance

First, we look at the importance of variables in the model.

2.3.4 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Modulus of Elasticity for Stem

Branching Distance

Stem Wood Density

Leaf Area

Leaf Mass Per Area

Leaf Carbon Concentration

Leaf Nitrogen Concentration

Leaf Phosphorus Concentration

Delta 15N

Thickness to Span Ratio

Conductivity Per Sapwood Area

Conductivity per Branch

Huber Value

Percent Lumen

Vessel Diameter

Percent Sapwood

Delta 13C

Species

Tree Age

2.3.5 Performance

How does the model perform?

2.3.6 Interactions | Table

Let’s explore the interactions in these data.

2.3.7 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 16.115, df = 3, p-value = 0.001074

2.3.8 Interaction | Group | Violin

2.3.9 Interaction | Group | Boxplot

2.3.10 Interaction | Group | Density

2.3.11 Interactions | Plots

Now, we plot interactions with values>0.10

Leaf Mass Per Area:Species

Delta 15N:Species

Branching Distance:Leaf Phosphorus Concentration

Delta 13C:Species

Delta 15N:Delta 13C

Species:Tree Age

Slope:Species

Branching Distance:Species

Light:Species

Soil Fertility:Branching Distance

pH:Species

Percent Lumen:Species

Vessel Diameter:Species

Leaf Phosphorus Concentration:Percent Lumen

Light:Branching Distance

2.3.12 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.